Fall R ’23
Dominic Bordelon, Research Data Librarian
University Library System, University of Pittsburgh
dbordelon@pitt.edu
Services for the Pitt community:
Support areas and interests:
| # | Date | Title |
|---|---|---|
| 1 | 8/29 | Getting Started with Tabular Data |
| 2 | 9/5 | Working with Data Frames |
| 3 | 9/12 | Data Visualization |
| 4 | 9/19 | Inference and Modeling Intro |
| 5 | 9/26 | Machine Learning Intro |
Probability distributions in R are typically described by four functions. The examples here are for the Normal distribution.
All have mean and sd arguments.
dnorm(x) for density (height of curve) at \(x\)pnorm(q) for getting a probability \(p\) value at some \(q\), i.e., \(P(X \le q)\)qnorm(p) for finding the quantile of some \(p\)rnorm(n) for generating a random sample of size \(n\) from \(N(\mu, \sigma)\)| Distribution | Name in functions | Example function | Applications |
|---|---|---|---|
| Normal (Gaussian) | norm |
pnorm() |
Numerous (often assumed) |
| Student’s \(t\) | t |
pt() |
Estimating parameters without knowing \(\sigma\); \(t\)-tests |
| Binomial (Bernoulli) | binom |
pbinom() |
Number of successes in size trials; logistic regression |
| Chi-squared (\(\chi^2\)) | chisq |
pchisq() |
Chi-squared tests (goodness of fit in a 2-way table) |
| \(F\) | f |
pf() |
ANOVA; \(F\)-tests (model goodness-of-fit) |
Hypothesis tests in base R tend to follow this format:
t.test() can also be found by calculating the \(t\) test statistic, followed by pt() to obtain a \(p\)One-sample \(t\)-test
“Is sample mean \(\bar{x}\) different from population mean \(\mu\)?”
\[ t = \frac{\bar{x}-\mu_0}{s/\sqrt{n}} \]
mu <- 4147
x_bar <- mean(penguins$body_mass_g, na.rm=TRUE)
s <- sd(penguins$body_mass_g, na.rm=TRUE)
n <- penguins[!is.na(penguins$body_mass_g), "body_mass_g"] %>%
pull() %>%
length()
t <- (x_bar - mu) / (s / sqrt(n))
# find p for this t score, on t distribution with n-1 degrees of freedom:
pt(q = t,
df = (n - 1),
lower.tail = TRUE)
# or:
t.test(x = penguins$body_mass_g,
mu = mu)Two-sample \(t\)-test (independent)
“Is sample mean \(\bar{X}_1\) different from sample mean \(\bar{X}_2\)?”
\[ t = \frac{\bar{X}_1-\bar{X}_2}{s_p \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \]
“Is the conditional distribution across two categorical variables equal, or varied?”
chisq.test(x, y = NULL) where x is a matrix, or x and y are vectors
chisq_test(formula) where formula is written in the format response ~ explanatory
⚠ This is not a good example of a chi-squared test, statistically speaking: our data do not meet the requirements for the test! But the example is left here because R will let us do it! (and it is syntactically correct)
infert <- penguins %>%
specify(response = body_mass_g) %>%
hypothesize(null = "point", mu = 4147) %>%
calculate(stat = "t") %>%
pull()
penguins %>%
specify(response = body_mass_g) %>%
assume(distribution = "t") %>%
visualize() +
shade_p_value(obs_stat = t, direction = "greater")
penguins %>%
t_test(response = body_mass_g,
mu = 4147)
penguins %>%
filter(species %in% c("Adelie", "Chinstrap")) %>%
t_test(formula = body_mass_g ~ species)“What is the relationship between \(X\) and \(Y\)?”
Considering
\[ Y = \beta_0 + \beta_1X + \epsilon \]
where \(\beta_0\) is the \(y\) intercept, \(\beta_1\) is the slope coefficient, \(X\) is the explanatory or predictor variable, \(\epsilon\) is irreducible error, and \(Y\) is the response variable. We want to estimate (or “fit”) \(\hat{\beta}_0\) and \(\hat{\beta}_1\) such that the sum of squared residuals is minimized as much as possible.
lm(formula, data) fits a linear model and returns a model object, where formula is written response ~ explanatory and data is a data frame.
predict() outputs from a model“For some \(x\), what \(\hat{y}\) does the model predict?”
Models in R implement behavior for the predict() function, which supplies a data frame of input values \(X\) and returns a data frame containing corresponding \(Y\) values. The input data frame needs variable(s) of the same name(s) as the predictor(s). Or you can omit the data frame and get the fitted values.
💡 predict() functions are found in documentation under article names like predict.lm, even though the function call is predict().
parsnipToday we learned about:
Join us next week for machine learning intro!
R 4: Inference and Modeling Intro